14 research outputs found

    FIN-CLARIN - a humanities research infrastructure with emphasis on language

    Get PDF
    Miljardvis med ord och tusentals timmar med audio och video behövs som material för humanistisk forskning och i synnerhet sprÄkforskning. Dessutom behöver forskarna redskap för att förÀdla och jÀmföra sina egna datasamlingar med allmÀnna datasamlingar. NÀr ett forskningsprojekt Àr slut behövs det lagrings- och spridningsplatser för att göra rÄdata, redskap och forskningsresultat tillgÀngliga och anvÀndbara. Data, redskap och gemensamma anvÀndningsmöjligheter bildar tillsammans en forskningsinfrastruktur, som gör det möjligt att verifiera tidigare resultat och effektivare göra nya rön, nÀr alla inte behöver starta frÄn noll med att samla data och bygga analysredskap.Non peer reviewe

    OCR and post-correction of historical Finnish texts

    Get PDF
    This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy software and data-driven spelling correction that uses Weighted Finite-State Methods. Both model training and testing were done on Finnish corpora of historical newspaper text and the best combination of OCR and post-processing models give 95.21% character recognition accuracy.Peer reviewe

    Morpheme Segmentation Gold Standards for Finnish and English

    Get PDF
    This document describes Hutmegs, the Helsinki University of Technology Morphological Evaluation Gold Standard package, which contains gold-standard morphological segmentations for 1.4 million Finnish and 120 000 English words. The Gold Standards comprise surface-string, or allomorph, segmentations of word forms, as well as deep-level, or morpheme, segmentations of the words.Non peer reviewe

    Evaluating HeLI with non-linear mappings

    Get PDF
    Peer reviewe

    Semantic Domains in Akkadian Text

    Get PDF
    The article examines the possibilities offered by language technology for analyzing semantic fields in Akkadian. The corpus of data for our research group is the existing electronic corpora, Open richly annotated cuneiform corpus (ORACC). In addition to more traditional Assyriological methods, the article explores two language technological methods: Pointwise mutual information (PMI) and Word2vec.Peer reviewe

    HeLI, a Word-Based Backoff Method for Language Identification

    Get PDF
    In this paper we describe the Helsinki language identification method, HeLI, and the resources we created for and used in the 3rd edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2016 workshop. The shared task comprised of a total of 8 tracks, of which we participated in 7. The shared task had a record number of participants, with 17 teams providing results for the closed track of the test set A. Our system reached the 2nd position in 4 tracks (A closed and open, B1 open and B2 open) and in this paper we are focusing on the methods and data used for those tracks. We describe our word-based back-off method in mathematical notation. We also describe how we selected the corpus we used in the open tracks.Peer reviewe
    corecore